Dall-e 3 user's test
Exploring AI, Data, and Technology
They are all three entry points for interacting with the processing engine:
Setting up Spark Context involves setting directly the spark master or using SparkConf for detailed configuration.
Spark session uses a builder pattern.
SparkContext manages RDDs directly whereas SparkSession had an embedded SparkContext that handles these interactions
SparkContext does not support DataFrames only SparkSession does
In spark narrow transformations are those that do not require a shuffle to perform. This increases the performance as the operations can be handled individually in their own partitions. What are some of these operations filter, map and union.
On the contrary operations that do require shuffling between workers are called wide transformations. They have a performance penalty in execution time but they are sometimes unavoidable. Joins, groupbykey, reduceByKey are operations that require shuffling of data.
Lazy evaluation is a feature on Spark that prevents transformations from being executed until an action is executed. Then Catalyst optimizer analyzes the transformations and builds the most efficient execution plan.
Serialization: It’s the process of taking an object and transform it into a stream of bytes so it can be sent across the network or saved more efficiently.
Deserialization: Is the opposite process, rebuilding objects from a stream of bytes.
These processes are needed as spark needs to share data between the driver and worker nodes.
In Spark you can find two types of serializers Java serializer and Kryo serializer.
Using Kryo serialization in Pyspark.
Using Custom serialization
It depends on how large the dataset. Pyspark is usually faster than python because it can distribute the workload in different nodes and process data in parallel.
However if the datasets you're working with, are small, they fit comfortably in a single machine a python program might be more efficient.
Given the following csv file:
Consider the following dataframe
Cache will save the RDD or dataframe in memory only. The persist method is used to store it at the level defined by user it can be MEMORY_ONLY, MEMORY_AND_DISK, DISK_ONLY etc.
A shuffle means that nodes have to send data across the cluster to other nodes to perform operations. These are called wide operations and the transmission of all this data introduces an overhead in time and resource utilization. In extreme cases it can lead to job failure.
Catalyst optimizer examines the dag execution plan and proposes a specific order to apply the transformations so that the execution plan uses resources efficiently.. only available for dataframes and datasets.
One is to use narrow transformations whenever possible. The second is a broadcast in joins for tables that are small.
I would start by examining the execution plan (df.explain() or Spark UI) to see which stages or transformations are triggering heavy shuffles or scans.
Once an action is called catalyst creates a graph from the action backwards looking at the previous transformation until it arrives to the sources datasets. A spark job is divided in stages that in turn those are a collection of tasks. There are narrow stages and wide stages. Narrow stages mean that there's no shuffle needed and that all the transformations are done in each worker. Whereas in wide stages there's shuffle of data between the partitions.
In map a function is applied to every element of the RDD.
In flatmap flatMap applies a function to each element of the RDD and flattens the results.
In mapPartitions a function is applied to a whole partition instead of each element. For example if we wanted to get the average of a partition
examples:
map
flatMap
mapPartitions
Broadcast join is used when a table is very small to be joined with a bigger table so is worth it to broadcast it to the other nodes so the join is done in a more efficient way. A skewed join is when the join key is not evenly distributed creating overload in some partitions and not enough load in others. A sign a join is skewed, is that the partitions have very few items or too many. You can use salting to mitigate a skewed join.
examples:
broadcast
skewed salting